Conceptual


Q1

$$ P(X)=\ 1-P(X)=\ =e^{0+1X}

$$

Q2

\(p_k(x)\) is the posterior probability that an observation X=x belongs to the the kth class and it is: \[ p_k(x)=\frac{\pi_k\frac{1}{\sqrt{2\pi}\delta}exp(-\frac{1}{2\delta^2}(x-\mu_k)^2)}{\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2\delta^2}(x-\mu_l)^2)} \]

Then we take the log for both sides:

*Notice that the denominator is a constant for the observations in each class, so we set it \(C\).

\[ \log p_k(x)=\frac{\log \pi_k+\log\frac{1}{\sqrt{2\pi}\delta}-\frac{x^2}{2\delta^2}-\frac{\mu_k^2}{2\delta^2}+\frac{x\mu_k}{\delta^2}}{C} \] Here we can see the second term and the third term is the same for all classes, so maximizing \(\delta_k(x)\) in (4.13) is equivalent to maximize \(\log p_k(x)\). Since log is an increasing function when it is positive, \(p_k(x)\) in (4.12) is equivalent to (4.13).

Q3

We can write posterior probability of the k-th class: \[ p_k(x)=\frac{\pi_k\frac{1}{\sqrt{2\pi}\delta_k}exp(-\frac{1}{2\delta^2}(x-\mu_k)^2)}{\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2\pi}}exp(-\frac{1}{2\delta_k^2}(x-\mu_l)^2)} \] Then we can take the log, which is similar to (3) and the denominator is still a constant C. \[ \log p_k(x)=\frac{\log \pi_k+\log\frac{1}{\sqrt{2\pi}\delta_k}-\frac{x^2}{2\delta_k^2}-\frac{\mu_k^2}{2\delta^2}+\frac{x\mu_k}{\delta^2}}{C} \] Notice the third term and it is a quadratic function of x. So Bayes’s classfier not linear.

Q4

(a)

On average, 1/10 will be used to make the prediction because observations are uniformly distribution.

(b)

On average, when p=2, \(\frac{1}{10}*\frac{1}{10}=\frac{1}{100}\)

(c)

On average, when p=100, \((\frac{1}{10})^{100}\)

(d)

Suppose on each dimension, we uniformly divided range of [0, 1] to 100 bins. If we want to find the nearest points within 10% of range of X1, X2 thourgh X100 (given p=100), then we need 10100 of data to cover the range. Therefore, as the p increases, the number of observations needed grows exponentially.

(e)

When there are a large number of dimensions, the percentage of observations that can be used to predict with KNN becomes very small. This means that for a set sample size, more features leads to fewer neighbors.

(f)

*p=1: side length is 1.

*p=2: side length is \((0.1)^\frac{1}{2}\)

*p=n: side length is \((0.1)^\frac{1}{n}\)

Q5

(a)

QDA better on training set, LDA better on test set.

(b)

QDA better on both sets.

(c)

As sample size increases, test prediction aacuracy of QDA is improved relative to LDA.

Reason: QDA is more flexible than LDA, so when data size incrases, it can fit better.

(d)

False. QDA tend to be overfitting when the decision boundary is just linear.

Q6

(a)

$$ \begin{aligned} p(Y=A|X_1=40,X_2=3.5)&=

\end{aligned} $$

exp(-6+0.05*40+1*3.5)/(1+exp(-6+0.05*40+1*3.5))
## [1] 0.3775407

The probability for this student to get A is nearly 0.38.

(b)

\[ \frac{\log(\frac{p(x)}{1-p(x)}e^{-\beta_0-\beta_2X_2})}{\beta_1}=X_1 \]

log(0.5/0.5*exp(6-3.5))/0.05    
## [1] 50

50 hours are needed.

Q7

There are two classes of response value. Here we can use Bayes’ therorem: \[ Pr(Y=Yes|X=x)=\frac{\pi_1 f_1(x)}{\sum_{l=1}^2\pi_l f_l(x)} \] Here \(\pi_1=0.8,\pi_2=0.2\),f_1(x) is the density function of a normal random variable with mean 10 and variance 36 and f_2(x) is the density function of a normal random variable with mean 0 and variance 36.

(0.8*dnorm(4,mean=10,sd=6))/(0.8*dnorm(4,mean=10,sd=6)+0.2*dnorm(4,mean=0,sd=6))
## [1] 0.7518525
Therefore, we can get: $$ \[\begin{aligned} Pr(Y=Yes|X=4)&=\frac{\pi_1 f_1(4)}{\sum_{l=1}^2\pi_l f_l(4)}\\ &\approx0.75 \end{aligned}\]

$$ The probability that a company will issue a dividend this year given X=4 is about 0.75.

Q8

Q9

(a)

we want to solve p(x) \[ p(x)/(1-p(x))=0.37 \] so p(x) is about 0.27. It means in average, 27% people with odds 0.37 will in fact default.

(b)

The odds is \(0.16/(1-0.16)\approx0.19\)

Applied

Q10

(a)

library(ISLR)
head(Weekly)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260
pairs(~Year+Lag1+Lag2+Lag3+Lag4+Lag5+Volume+Today,Weekly)

(b)

glm.fit=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Weekly,family=binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Lag2 appears to be statistically important.

(c)

glm.prob=predict(glm.fit,type='response')
glm.pred=rep('Down',1089)
glm.pred[glm.prob>0.5]='Up'
table(glm.pred,Weekly[,"Direction"])
##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557
# complete.cases(Weekly[,"Direction"])

Overfall fraction of correct predictions is \(\frac{54+557}{54+48+430+557} \approx0.56\)

The error rate of falsely predict down days to up is \(\frac{48}{48+54} \approx0.47\)

The error rate of falsely predict up days to up is \(\frac{430}{430+557} \approx0.44\)

(d)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
train=filter(Weekly,Year>=1990&Year<=2008)
test=filter(Weekly,Year>2008)
glm.fit1=glm(Direction~Lag2,data=train,family=binomial)
glm.prob1=predict(glm.fit1,test,type='response')
glm.pred1=rep('Down',104)
glm.pred1[glm.prob1>0.5]='Up'
table(glm.pred1,test[,"Direction"])
##          
## glm.pred1 Down Up
##      Down    9  5
##      Up     34 56
mean(glm.pred1==test[,"Direction"])
## [1] 0.625

overall fraction of correct predictions for the held out data is \(0.625\).

(e)

library(dplyr)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
lda.fit=lda(Direction~Lag2,train)
lda.pred=predict(lda.fit,test)
lda.class=lda.pred$class
table(lda.class,test[,"Direction"])
##          
## lda.class Down Up
##      Down    9  5
##      Up     34 56
mean(lda.class==test[,"Direction"])
## [1] 0.625

overall fraction of correct predictions for the held out data is \(65/104\approx0.625\).

(f)

qda.fit=qda(Direction~Lag2,train)
qda.pred=predict(qda.fit,test)
qda.class=qda.pred$class
table(qda.class,test[,"Direction"])
##          
## qda.class Down Up
##      Down    0  0
##      Up     43 61
mean(qda.class==test[,"Direction"])
## [1] 0.5865385

overall fraction of correct predictions for the held out data is \(61/104\approx0.59\).

(g)

library(class)
train.x=as.matrix(train$Lag2)
test.x=as.matrix(test$Lag2)
train.y=train$Direction
set.seed(1)
knn.pred=knn(train.x,test.x,train.y,k=1)
table(knn.pred,test$Direction)
##         
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn.pred==test[,"Direction"])
## [1] 0.5

overall fraction of correct predictions for the held out data is \(0.5\).

(h)

Logistic regression and LDA provide the best results.

(i)

Here we conduct experiments with KNN.

k=c(1,3,5,10,15,20,30)
rate=rep(0,7)
a=1
for(i in k){
  knn.pred=knn(train.x,test.x,train.y,k=i)
  table(knn.pred,test$Direction)
  rate[a]=mean(knn.pred==test[,"Direction"])
  a=a+1
}
rate
## [1] 0.5000000 0.5576923 0.5384615 0.5769231 0.5865385 0.5769231 0.5576923

When k value is between 10 to 30, we can most likely obtain the best results.

Q11

(a)

med=median(Auto$mpg)
mpg01=rep(0,dim(Auto)[1])
mpg01[Auto$mpg>=med]=1
new_Auto=data.frame(Auto,mpg01)

(b)

pairs(new_Auto)

mpg weight horsepower acceleration displacement are most likely useful in predicting mpg01

(C)

set.seed(1)
tr <- sample(1:nrow(new_Auto), nrow(new_Auto)*0.7 , replace=F)  # 70% train, 30% test
train <- new_Auto[tr,]
test <- new_Auto[-tr,]

(d)

lda.fit=lda(mpg01~mpg+weight+horsepower+acceleration+displacement,train)
lda.pred=predict(lda.fit,test)
lda.class=lda.pred$class
table(lda.class,test$mpg01)
##          
## lda.class  0  1
##         0 55  1
##         1  6 56

Test error rate is 0

(e)

qda.fit=qda(mpg01~mpg+weight+horsepower+acceleration+displacement,train)
qda.pred=predict(qda.fit,test)
qda.class=qda.pred$class
table(qda.class,test$mpg01)
##          
## qda.class  0  1
##         0 55  2
##         1  6 55
mean(qda.class==test$mpg01)
## [1] 0.9322034

Test error rate is about 0.03.

(f)

glm.fit=glm(mpg01~mpg+weight+horsepower+acceleration+displacement,train,family=binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
glm.prob=predict(glm.fit,test,type='response')
glm.pred=rep(0,93)
glm.pred[glm.prob>0.5]=1
table(glm.pred,test$mpg01)
##         
## glm.pred  0  1
##        0 60  0
##        1  0 57

Test error rate is about 0

(g)

library(class)
train.x=dplyr::select(train,mpg,weight,horsepower,acceleration,displacement)
train.y=dplyr::select(train,mpg01)[,1]
test.x=dplyr::select(test,mpg,weight,horsepower,acceleration,displacement)
test.y=dplyr::select(test,mpg01)[,1]
min_error_rate=1000
k=0
for (i in 1:200){
 knn.pred=knn(train.x,test.x,train.y,i)
 t=table(knn.pred,test$mpg01)
 error_rate=(t[2]+t[3])/sum(a)
 if(error_rate<min_error_rate){
   min_error_rate=error_rate
   k=i
 }
}
c(min_error_rate,k)
## [1] 1.5 3.0
##There may be multiple functions with the same name in multiple packages. The double colon operator allows you to specify the specific function you want:

KNN performs best around 29.

Q12

(a)

Power<-function(){
  print(2^3)
}

(b)

Power2<-function(x,a){
  print(x^a)
}
Power2(3,8)
## [1] 6561

(c)

a<-data.frame(c(3,8),c(8,17),c(131,3))
for(i in 1:3){
  Power2(a[1,i],a[2,i])
}
## [1] 6561
## [1] 2.2518e+15
## [1] 2248091

(d)

Power3<-function(x,a){
  result=x^a
  return(result)
}

(e)

x=1:10
y=rep(0,10)
for(i in 1:10){
  y[i]=Power3(i,2)
}
plot(x,y,log='x')

plot(x,y,log='y')

plot(x,y,log='xy')

(f)

PlotPower<-function(x,a){
  y=rep(0,length(x))
  for(i in 1:10){
    y[i]=Power3(x[i],a)
  }
  plot(x,y)
}
PlotPower(1:10,3)

Q13